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ABSTRACT 


A  partially  observable  Markov  process  is  a  inathemat ical  model 
of  a  dynamic  probabilistic  system  which  consists  of  an  underlying 
Markov  process  obscured  from  direct  observation  by  imperfect  output 
channels.  The  observed  output  R(t)  is  stochast ical ly  related  to 
the  underlying  state  S(t) .  This  model,  like  the  Markov  model,  is 
applicable  in  the  analysis  of  a  wide  range  of  sequential  decision 
problems . 

The  primary  area  of  investigation  in  this  report  is  the  selec¬ 
tion  of  a  course  of  action  from  a  set  of  alternatives  using  only  the 
information  about  the  system  which  is  available  from  the  observable 
outputs.  Associated  with  the  model  is  a  cost  structure.  The  decision¬ 
maker  may  use  the  observed  outputs  to  make  inferences  about  the  under¬ 
lying  Marko'-’  state  and  will  be  assessed  rewards  or  penalties  depending 
on  the  true  state  of  nature  and  on  the  action  taken. 

The  state  of  knowledge  vector  s ( t)  summarizes  all  that  is  known 
about  the  probability  of  the  system  being  in  each  of  the  underlying 
states  as  a  function  of  the  observed  outputs.  The  optimal  policy 
will  specify  a  course  of  action  to  be  taken  for  each  possible  state 
of  knowledge  s (t)  for  all  possible  t.  The  policy  depends  on  the 
oecision-maker 1 s  knowledge  of  the  underlying  Markov  state,  on  the 
cost  structure  associated  with  the  model,  and  on  the  criterion  of 
optimum  used. 

Dynamic  programming  techniques  are  shown  to  be  of  use  in  the 
optimization  of  both  transient  and  steady  state  policies.  The 
analysis  is  conducted  with  the  optional  availability  of  a  perfect 
information  channel  at  added  cost.  Computer  programs  were  written 
for  policy  evaluation  and  optimization,  and  specific  numerical 
results  are  included  in  this  report. 
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CHAPTER  I 


INTRODUCTION 

One  of  the  basic  problems  In  science  and  engineering 
is  the  construction  of  models  whose  mathematical  behavior 
will  approximate  the  physical  behavior  of  real  world  sys¬ 
tems.  In  the  analysis  of  certain  types  of  nondetermlnistlc 
systems,  the  Markov  model  has  shown  Itself  ;o  be  a  very 
useful  tool,^  The  "partially  observable"  Markov  model  Is 
an  extension  which  takes  Into  account  the  effect  of  Imper¬ 
fect  observations  of  the  state  of  the  dynamic  system. 

The  concept  of  "state"  Is  central  to  modelling.  The 
condition  or  state  of  a  system  may  be  specified  by  giving 
the  values  of  relevant  parameters.  For  example,  the  state 
of  a  gas  may  be  specified  by  giving  Its  temperature,  pressure, 
and  the  enclosing  volume.  The  state  of  a  highway  toll  station 
may  be  specified  at  any  given  Instant  by  the  number  of  col¬ 
lection  booths  operating  and  the  number  of  vehicles  In  each 
queue.  As  time  progresses,  the  parameters  vary  and  the 
system  changes  state,  thereby  exhibiting  dynamic  behavior. 

The  most  general  probabilistic  system  would  nave  the  para¬ 
meters  taking  a  continuous  allowable  range  of  values,  and 
would  allow  the  parameters  to  change  at  any  instant  in  time. 
This  would  require  a  continuous  state  and  continuous  time 
probabilistic  model  to  describe  the  system.  If  Sill  Is  the 
state  of  the  system  at  time  t,  in  general  s( t )  will  depend 
on  the  entire  history  of  the  system  previous  to  time  t. 
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Thus  a  statistical  description  of  the  future  of  the  system 
will  in  general  depend  on  both  the  present  state  at  time  t 
and  the  complete  history  of  the  system  previous  to  time  t. 

1.1  The  Markov  Process 

If  only  knowledge  of  the  present  state,  and  not  the 
entire  history,  is  necessary  to  allow  statistical  descrip¬ 
tion  of  the  future  of  the  stochastic  system,  the  process  Is 
Markovian.  Although  this  Is  a  severely  restrictive  assump¬ 
tion,  in  actual  fact  many  real  world  syste.es  may  be  accurately 
modelled  as  Markov  processes.  A  few  prominent  areas  of  Mar¬ 
kov  process  application  are  marketing.  Inventory  control, 
traffic,  quality  control,  equipment  replacement,  routing, 
and  portfolio  investment. 

To  llluetrate  what  the  Markovian  assumption  e*. -ails 
consider  the  following  example i 

A  housewife  buys  groceries  at  the  same  store  cnee  every 
week.  The  store  carries  two  brands  of  milk,  A  and  B.  The 
state  of  tne  system  in  a  given  week  would  be  the  brand  of 
milk  she  bought  that  week.  The  present  week  is  time  n, 
and  the  probability  she  buys  brand  A  at  week  n+1  given  her 
history  of  purchases  1st 

P^s(n+1)*A  j  s< n)=l ,  s(n-l)  =  J,...  s(0)=m] 

where  l,Jv..,m  are  either  A  or  B 
depending  on  which  brand  she  bought 

that  week. 

The  Markovian  assumption  states  that  the  above  probabil¬ 
ity  depends  only  on  which  brand  she  purchased  this  week. 
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P[s(n+1)=A  J  s(n)=l, . . . s(Q)=b]  =  Pfs(n+l)*Aj  s(n)*l] 

=  P  (n) 
iA 


p  (n) 


Figure  1  Markov  Marketing  Model 

The  transition  probability,  P1J(n) ,  Is  the  probability 
that  the  state  at  tine  n  will  be  J  11  the  state  at  time  n~l 
was  1.  The  system  Is  called  tlae  Invariant  If  Pj_ j (R) -P^ j 
Independent  of  n.  For  a  descrete  state  and  time  Invariant 
model  with  N  states,  N"  transition  probabilities  would  be 
required ,  not  all  of  which  are  independent. 

1.2  The  Partially  Observable  Process 

A  partially  observable  Markov  process  Is  one  which 
oust  be  observed  through  an  imperfect  output  channel,  borne 
examples  of  imperfect  channels  are:  an  Imperfect  meter,  tne 
ataosphere  carrying  In  a  signal  from  outer  space,  and  the 
Incomplete  inspection  of  a  manufactured  product. 
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rocess 


Markov  p 

Figure  2  Model  of'  a  Partially  Observable  Markov  Process 

The  model  consists  of  an  underlying  Markov  process 
which,  depending  on  its  true  state,  supplies  values  of 
parameters  to  the  output  channels.  The  Imperfect  channels 
operate  on  che  Input  from  the  Harkov  process  and  yield 
outputs  which  in  most  cases  do  not  allow  the  observer  to 
ascertain  the  exact  underlying  Markov  state.  In  fact,  the 

a 

number  of  output  readings,  a,  may  not  even  eq-  al  the  number 
of  true  states,  k. 

The  lmperfec;  channel,  like  the  underlying  Markov 
process,  is  a  stochastic  process  and  can  be  described  by 
the  probabilities,  fjj(t),  which  are  the  probability  of 
output  J  at  time  t  given  that  the  true  state  was  1. 
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true  Markov  states 


outputs 

Figure  3  A  Two  State  Model  Showing  Output  Channels 

1 . 3  Reward.  Structure 

Associated  with  the  real  life  system  are  decisions  and 
rewards.  For  example,  a  decision  could  be  made  as  to  the 
true  Markov  state  at  time  t.  Various  rewards  can  be  defined 

L..J  reward  If  true  state  Is  1  and  the  observer 
estimates  that  it  Is  1. 

reward  If  true  state  Is  J  and  the  observer 
J  estimates  that  It  Is  i. 

These  rewards  form  the  basis  for  evaluating  the  effect 
which  a  given  decision  might  have, 

1.4  Dynamic  Inference 

The  partially  observable  Markov  process  is  one  of  a 
rather  large  class  of  systems  which  consist  of  one  stochas¬ 
tic  process  monitored  through  a  second  stochastic  process. 


outpu  t 
to 

observer 


Figure  4  More  General  "Fartlally  Observable  Stochastic 

Process  Model 

I:i  this  more  general  model  the  process  1  supplies 
statistical  parameters  to  process  1,  which  operates  on 
them  before  presenting  observable  parameters.  Information 
on  variations  In  the  parameters  of  process  1  must  be  gleaned 
from  the  pattern  of  the  observable  parameters  output  from 
process  2.  The  Information  about  process  1  obtained  In 
this  manner  Is  then  used  ir.  decision-making  and  to  predict 
future  developments.  The  general  problem  associated  with 
obtaining  information  about  the  underlying  process  is  known 
as  "dynamic  Inference."^ 

Applications  of  partially  observable  Markov  processes 
may  be  found  In  many  areas.  For  example,  the  true  value  of 
common  stocks  could  be  the  underlying  state  with  current 
Wall  Street  price  quotations  as  the  "Imperfect*  output 
variable.  One  might  consider  the  quality  of  a  manufactured 
product  as  being  the  underlying  state  with  results  of 
incomplete  Inspection  supplying  the  "imperfect"  output. 
Another  example,  from  the  marketing  area,  might  consist  of 


12 


a  customer's  brand  preference  as  the  underlying  state,  and 
his  latest  purchase  as  the  Inperfect  indicator. 

1 . 5  Previous  Investigations 

Work  has  recently  been  done  on  various  aspects  of  par- 
tially  observable  Markov  processes  by  Drake,  Kramer,^  ana 
Stoopes^.  Drake  and  Kramer  discussed  formulation  oi  the 
basic  model  and  considered  formation  of  the  S  vector,  or 
statistical  state  of  knowledge  vectorv  which  In  essence 
summarizes  all  that  is  known  about  the  probabilities  cf  the 
underlying  Markov  process  being  In  each  state  at  time  t. 

They  considered  methods  of  updating  the  S  vector  as  new 
data  is  received.  Drake  further  considered  various  decoding 
schemes  on  the  observed  outputs  and  related  errors,  as  well 
as  Information  flow  and  associated  costs  on  simple  two 
state  symmetric  models. 

Stoopes'  main  investigation  was  In  extending  a  betting 

7 

policy  formulated  by  Kelly  which  entailed  betting  on  var¬ 
ious  input  states  a  fraction  of  one's  capital  proportional 
to  the  level  of  confidence  about  those  Input  states. 

1.6  Statement  of  Problem 

This  Investigation  concerns  the  optimization  of  policies 
associated  with  physical  systems  which  can  be  modelled  as 
"partially  observable  Markov  pro r  sses."  The  underlying 
Harkov  process  can  only  be  observed  through  a  stochastic 
output  channel.  Therefore,  the  future  effect  of  decisions 


made  utilizing  this  Imperfect  channel"  Information  cannot 
be  stated  exactly.  Since  the  observer  Is  dealing  with  imper¬ 
fect  data,  he  can  only  say  with  probability  PA  that  the 
effect  of  a  given  decision  will  be  A  and  with  probability 
P3  the  effect  of  the  same  decision  will  be  3.  This  compli¬ 
cates  the  decision  process. 

Associated  with  the  selection  of  a  course  of  action 
from  a  set  of  alternatives  is  a  cost  structure.  The  decision 
maker  may  use  the  observed  outputs  to  make  Inferences  about 
the  underlying  Markov  state  and  will  be  assessed  rewards  or 
penalties  depending  on  the  true  state  of  nature  and  on  the 
action  taken.  The  "optimum"  policy  will  depend  on  the  deci¬ 
sion-maker's  knowledge  about  the  underlying  Markov  state, 
on  the  cost  structure  associated  with  the  model,  and  on  the 
criterion  of  optimum  used.  There  are  several  possible 
criteria  of  a  "good"  decision.  The  decision  may  simply 
be  made  so  as  to  maximize  the  expected  value  of  the  reward, 
or  the  observer  may  wish  to  Impose  a  celling  on  allowed 
risk  and  maximize  his  expected  reward  while  never  risking 
a  loss  of  more  than  that  ceiling.  Alternately,  some  utility 
function  may  be  Imposed  upon  the  rewards  and  the  policy 
chosen  to  maximize  the  expected  utility  of  rewards.  In 
Drake's  work,  a  brief  introduction  co  the  above  problem 
is  found  for  a  symmetric  two  state  example.  This  report 
is  a  continuation  and  extension  of  that  introduction. 

The  major  mathematical  techniques  used  for  policy 
optimization  are  those  of  dynamic  programming  which  are 
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covered  extensively  in  Bellman1',  and  Bellman  and  Dreyfus  '. 

A  dynamic  programming  algorithm  for  optimization  of  regular 

Markov  processes  was  developed  by  Kowaixi0  and  extensive 

1 0 

work  was  done  in  the  same  area  by  Schweitzer1  .  In  this 
report,  application  is  found  for  those-  techniques  lr.  the 
area  of  partially  observable  Markov  processes. 

In  many  practical  situations  there  exists  a  way  to  ‘t 
nearly  perfect  information  about  the  underlying  Marko  - 
process--for  a  price.  Therefore  the  analysis  is  conducted 
with  the  optional  availability  of  a  perfect  information 
channel  at  an  additional  cost. 

A  two  state  Markov  process  monitored  at  descrcte  time 
Intervals  by  a  binary  channel  will  be  used  to  exemplify  the 
ideas  presented  in  this  report. 

One  might  give  this  a  physical  interpretation  from  the 
communications  area.  Consider  that  a  communications  satel¬ 
lite  has  been  placed  in  orbit  and  is  being  used  to  convey 
transoceanic  messages.  Unfortunately  because  of  various 
Interference  sources,  the  satellite  may  not  receive  and 
retransmit  an  Intelligible  signal.  Therefore  the  designers 
built  into  the  satellite  a  check,  whereby  the  quality  oi 
the  received  message  at  the  satellite  is  monitored.  Then, 
binary  data  Is  transmitted  back  to  the  sender  at  descrete 
time  intervals  telling  him  whether  the  received  message 
met  or  did  not  meet  preset  standards  of  quality. 

Assume  Lru»L  It  has  been  determined  that  the  process 
governing  whether  or  not  the  satellite  receives  an  accept¬ 
able  signal  Is  essentially  Markovian  with  time  invariant 


state  transition  probabilities.  The  binary  signal  the  mon¬ 
itor  returns  to  tne  sender  is  also  affected  by  the  interfer¬ 
ence  ar.d  is  therefore  not  fully  reliable,  but  the  conditional 
probability  distribution  of  outputs  is  known.  The  following 
partially  observable  Karkov  process  model  is  constructed  by 
the  decision-maker. 


Karkov  Process  outputs  from 

generating  binary  data  Imperfect  channel 

state  li  acceptable  message  received  at  satellite 
state  2 x  unacceptable  message  received  at  satellite 

Figure  5  Communications  Example 

The  decision-maker  can  now  use  this  model  as  an  aid 
in  the  evaluation  of  various  policies,  or  courses  of  action. 
Using  the  binary  output  data,  inferences  can  be  made  about 
the  signal  quality  at  the  satellite.  The  knowledge  about 
signal  quality  can  then  be  used  along  with  the  cost  struc¬ 
ture  to  evaluate  the  expected  consequences  of  various  courses 
of  action. 
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Various  alternatives  night  be  available  to  the  decision¬ 
maker.  He  might  continue  regular  transmission,  resend  a 
portion  of  the  message,  discontinue  transmission  for  one  or 
more  time  units,  conduct  additional  tests  of  signal  quality, 
or  build  a  new  and  different  communications  system. 

Markovian  models  have  shown  themselves  to  be  very  use¬ 
ful  in  the  past.  The  techniques  developed  In  this  report 
allow  the  extension  of  analytical  methods  for  optimal 
decision  making  to  include  the  case  of  the  Markov  process 
being  "obscured"  by  an  Imperfect  information  channel. 


OPTIMAL  TIME- DEPENDENT  POLICIES 

When  staking  a  decision,  an  extremely  useful  quantity 
to  know  is  the  total  reward  or  cost  that  can  be  expected 
as  a  consequence  of  the  particular  decision  made.  Dynamic 
programmlng  allows  the  calculation  of  future  expected  util¬ 
ity  of  rewards  as  a  function  of  policy  in  sequential  decision 
problems,  and  therefore  allows  the  selection  of  a  decision 
to  maximize  total  expected  utility  of  rewards. 

In  sequential  decision  problems,  decisions  may  be  made 
at  certain  points  ir,  time  and  each  decision  will,  in  general,, 
carry  with  it  implications  which  extend  far  into  the  future 
and  affect  decisions  us  yet  unmade.  Likewise,  what  the 
policy-maker  Intends  to  do  In  the  future  will  affect  his 
present  decision. 

There  are  two  basic  techniques  in  cynamic  proBramming . 
These  involve  solving  a  problem  in  eitner  "value  space"  or 
"policy  space."  In  this  chapter  the  "value  space"  techni¬ 
que  is  explained  and  is  applied  to  partially  observable 
Harkov  processes.  In  Chapter  III  the  "policy  space"  techni¬ 
que  will  be  employed  in  the  determination  of  optimum  policies. 

Consider  the  two  state  communications  example  from 
Chapter  I  where  the  observer  period ically  receives  ini or- 
matlon  on  the  reception  quality  at  the  satellite. 
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Figure  6  Communl cations  Example 


Markov  state  1»  message  at  satellite  meets  preset 

standards 

Markov  state  2:  message  at  satellite  fails  to  meet 

preset  standards 

Output  i:  Ground  Observer  receives  signal  — 

"state  1" 


Output  2 < 


Ground  Observer  receives  signal— 
"state  2" 


The  ground  observer  now  has  to  make  a  decision  on  the 
basis  of  the  stochastically  inaccurate  output  signals. 

Assume  first  that  he  has  only  two  options  open: 

1)  Continue  transmitting  until  next  output  is  received 

2)  Stop  transmitting  and  check  again  one  time  unit 
later 

This  problem  will  be  first  approached  in  the  time- 
dependent  case.  That  is,  the  observer  doesn't  have 


unlimited  access  to  the  use  of  the  satellite  but  must  defin¬ 
itely  quit  at  some  time  n  units  into  the  future.  It  will 
be  convenient  to  measure  one  unit  of  time  as  the  time  between 
output  signals.  When  dynamic  programming  techniques  are 
applied  in  the  analysis  of  processes  which  will  terminate 
at  some  specific  time  in  the  future,  it  is  conventional  to 
call  the  termination  time  zero  and  measure  time  in  reverse 
of  normal  order.  Thus,  in  this  example,  the  current  time 
is  n,  and  the  process  must  terminate  at  time  zero  which  is 
n  time  units  into  the  future.  The  time  Independent  policy, 
where  the  physical  process  terminates  far  into  the  future  or 
continues  Indefinitely,  will  be  considered  in  Chapter  III. 

Inherent  in  the  decision  problem  is  a  reward  structure. 

L,,:  Utility  of  reward  if  he  continues  transmit¬ 
ting  and  the  underlying  Markov  state  is  1 

L22*  Utility  of  reward  if  he  stops  transmitting 
and  the  underlying  Markov  state  is  2 

1^2*  Utility  of  reward  if  he  continues  transmit¬ 
ting  and  the  underlying  Markov  state  is  2 

Lpi*  Utility  of  reward  if  he  stops  transmitting 
and  the  underlying  Markov  state  is  1. 

The  preceding  problem  uses  a  two  state  process  with 
two  options  allowed  at  each  decision  point.  That  type 
problem  will  now  be  solved  in  general. 

2.1  Notation 

In  computations  to  come  the  following  shorthand  nota¬ 


tion  will  be  useful 


S(n)=x  :  Underlying  Markov  state  at  time  n  is  x 
R(n)  J  Output  response  at  time  n 

To  summarize  the  decision-makers 's  knowledge  at 
time  n: 

sx(n)  =  P[S(n)=x  |  R(n),R(n+l),R(n+2)...] 

=  probability  that  the  underlying  Markov 
state  is  x  at  time  n,  given  the  past 
history  of  observed  outputs. 

For  an  N  state  process,  a  "state  of  knowledge"  vector 
is  defined: 

s(n).  =  sx(n),  Sg{n) ,  ...  sN(n)^ 


The  "state  of  knowledge"  vector  summarizes 
all  that  the  observer  knows  about  the  process  at 
time  n.  For  a  two  state  process,  s,(n)  is  suffi¬ 
cient  to  determine  the  s(n)  vector  Decause  it  is 
known  that  the  underlying  Markov  state  is  either  1 
or  2  and  therefore  s2(n)  can  be  found  from  s-^(n). 


s2(n)  =  l-s1(n) 

PXT s(n)  1  =  P[R( n-1)  =  x  J  sin}.] 

=  probability  that  the  next  output  is  x 
given  the  current  state  of  knowledge 


The  state  of  knowledge  vector  must  be  updated  as  new 
information  is  received: 


txI-IpU  s  updated  state  cf  knowledge  at  time  n-1 

given  that  the  output  at  time  n-1  was  x. 
The  new  state  of  knowledge  is  a  func¬ 
tion  of  the  old  s(n)  and  the  decision¬ 
maker  sees  R( n-1 }  before  he  must  update 

sin). 
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The  probability  distribution  on  the  next  output  to  be 
received  will  prove  useful.  The  state  of  knowledge  vector 
gives  a  probability  distribution  on  the  underlying  Markov 
state,  and  if  the  underlying  Markov  state  were  know  to  be 
i,  the  probability  of  output  J  would  be  f^.  For  the  two 
state  case,  the  next  output  at  time  n-1  is  predicted  to  be 
1  or  2  with  probabilities: 

P-^  C  s  ( n )  j  —  s^(n)  ^  21^  (p22^21  ^21^11^ 

P2[s(n)  1  =  s^(n)  ^12^22 ^  ®2^^^  ^^22^22  ^21^12^ 

The  above  equations  can  be  written  in  matrix  form  for 
the  two  state  case  and  then  extended  to  the  N  state  case. 


Two  state  process: 


Pf s(n) 1  =  row  matrix  of  probabilities  of  next 

»  —  — »  output  reading 


=  Sin)  [P]  [F]  =  Pj^sCn)  ].  Pg[s(n)  ] 


P11  P12  *  '  '  P1N 

fll  f12  flN  j 

P2i  * 

*■  * 

>  ■  . 

•  * 

r  v  1  _ 

in  - 

* 

a 

• 

’ 

a 

• 

p»n  P-- 

1X1  .'■M 

f*  f 

'  Nl  KN 

P[ s(n)  1  =  sin)  [FJ  fF] 

■ -  -  ■ 


If  aJ  Is  defined  as  Che  lt“  column  of  matrix  [A], 
the  component:  of  the  Pf s(n) 1  vector  are  simply  written, 

r.  T  s(  n  )  1  =  s(n)  [P]  =  a  scalar 

2 . 3  Undatlng  the  State  of  Knowledge 

The  stavvi  of  knowledge  vector  changes  with  time  and  it 
will  be  necessary  to  update  It  as  new  information  is  received. 
Recall  that  time  is  to  be  measured  in  reverse  order.  The 
current  time  is  n  and  the  process  must  terminate  at  time 
zero  which  Is  n  time  units  Into  the  future.  If  the  decision¬ 
maker  does  not  have  the  output  readings  available,  his  state 
of  knowledge  vector  would  change  with  time  as  follows: 


i ( n-1 )  =  new  state  of  knowledge  vector  If  the  decision¬ 
maker  does  not  have  the  output  readings 
available . 
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For  the 


state  processt 


a(n-l)  =  s(n)  [ F ]  =  i(n-l),  y,(n-l) 

—  .  ■  ■  -  .  wd 

i1(n-l)  =  s1(n)pli  4  sJ(n)p21 
4^2 (  n—  1 )  =  s1(n)p12  +  s2(n)p?2 

For  the  N  state  process  the  matrix  equation  Is  the 

same  . 


Hn-1)  =  s(n)  [  P]  =  *1(n-l),  +-( n-1) , . . . f vN(n-i ) 


Now  consider  giving  the  decision-maker  the  advantage 
of  seeing  the  output  response  R(n-l)  before  he  must  update 
the  state  of  knowledge  vector.  Given  that  output  "1"  has 
been  observed,  the  new  state  of  knowledge  vector  can  oe 
computed  to  be i 


=  [s(n-I }  given  R(n-l)=l]  =  Tlt  ?t.s^n)  J 

Given  output  "1",  the  J—  component  of  the  new  state 
of  knowledge  vector  Is: 

Tj, jT s(n) 1  =  P[S(n-l)  =  J  j  H(n-l)  =  l  and  s(n) ] 

P[S(n-l)=J  and  R(n-1 )=1  |  q(  n)  1 
P[ R( n-1  )  =  1  |  s  ( xxl  j 
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Ti,  jLsinJ.3  = 


P[S(n-l)=J  and  fi(n-l)=i  |  s(n) ] 


=  P[ R( n-1 }=1  |  S(n-l>J  and  s( n) ]  P[3(n-l)=J  j  s(n) . 

P1 [ s (  n)  ] 


f.n  *.i(n~1J 
?!  f  sin  j  1 


For  a  two  state  process,  using  the  above  relatlom 

T,  ,  T s ( r. )  1  =  [sl(n)pll  *  s2(n)p21]  fn 

PjCs^ni] 

Tj^Lsln)  1  =  l-TI^1[s.Cn)  1  =  [s1(n)p12  +  s2(n)p22  -1  f2i 

T=tirsfn)1  =  [sl(n)pll  +s2(n)p21]  f12 

P2CsCn) 3 


t2  2[s(n) 1  *  1_T2  1 r  s ( n ' 1  =  [sl(n)pl2  +  s2(n)p22^  f; 


P2[s(n) 1 


The  new  state  of  knowledge  vector  s(n-l )  will  then 
depend  on  what  output  Is  observed  at  time  n-1. 
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If  the  output  at  time  n-1  was  II 


s(n-l)  =  T1[s(nj  j  =  ILi;sin)], 
If  the  output  at  time  n-1  was  2i 


s  ( n-1  )  -  T  „  f  s  ( n )  J 


The  observer  will  receive  an  output  reading  "1"  and 
must  update  his  state  of  Knowledge  s ( n- 1 )  =  T ^ [ 3 ( n )  J .  Even 
If  no  output  reading;  Is  available  to  the  decision-maker, 
his  state  of  knowledge  still  changes  s(  n-1 )  =  *( n-1 ) 


2 . *+  Policy  Evaluation 

If  there  are  a  number  of  options  available  to  the 
decision-maker  for  each  value  of  the  state  of  knowledge 
vector  s(n) .  then  a  policy  would  specify  which  option  (k) 
to  take  for  each  possible  s(n )  for  all  n.  The  choice  of 
policies  will  usually  be  affected  by  the  total  expected 
reward  associated  with  each  different  policy. 

The  expected  reward  can  be  separated  into  two  cate¬ 
gories!  Immediate  and  future.  The  reward,  to  be  expected 
during  the  current  time  unit  only  Is  known  as  immediate 
reward.  Future  reward  Is  the  expected  reward  In  the  aggre¬ 
gate  of  all  future  time. 

This  grouping  of  rewards  forms  the  basis  of  the  dynamic 
programming  equations  to  be  used  throughout  the  remainder 
of  this  report. 
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Total  expected  reward 


Immediate  expected 
reward  In  the 


with  n  time 
units  remaining 


^current  unit  of  time, 


Total  expected  future1 
+  ^  reward  with  n-1 
\  time  remaining 


The  expected  immediate  reward  will  be  affected  by  the 
option  (k)  which  the  decision-maker  exercizes  at  time  n  and 
by  his  state  f  knowledge. 


=  Immediate  expected  reward  in  the  current 
u  unit  of  time  as  a  functior  of  the  state 

of  knowledge  If  option  k  Is  exercized. 


Continuing  with  the  two  state  process,  recall  that 
at  present  only  two  options  are  allowed.  In  general  terms 
those  two  options  are: 


k=l  ;  estimate  underlying  Markov  state  1  as  the 
current  Markov  state  and  act  accordingly 


k=2  :  estimate  underlying  Markov  state  2  as  rhe 

current  Markov  state  and  act  accordingly 


For  the  communications  example,  these  options  a.rei 


k=l  :  continue  transmitting  until  the  next  output 
is  received 


k=2  i  stop  transmitting  and  check  again  one  time 
unit  later 
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The  Immediate  expected  earnings  In  the  current  unit 


of  time  arei 


q  ■_[  s(  n)  ]  =  si(n'Li<i  '*  s2(n^L^^ 


l 


Sj(n)1'kj 


where 


earnings  if  estimate  k  and  true 
state  is  j 


The  total  expected  earnings  are  given  by  the  expres¬ 
sion  below  for  a  f lxcd  policy. 


s(  n ) ]  =  total  expected  earnings  with  n  time  left  as  a 
function  of  the  current  state  of  knowledge. 


/ Immedlatei  /Expected  eamings\ 

pH[ b( n ) j  =  I  expected  I  +  (  in  time  n-1  given  |  •  P. Ls(n)  ] 

\eandngs  J  \  R(n-l)  =  1  ) 


(Expected  earnings' 
with  time 
given  H( i 


earnings \ 
Lme  n-1  J 
[  n-1 )~Z  J 


Xsjn)  ] 


q^istn)  ]  +  P-[  [  s  ( n )  j  Fn*~lTI(s(n)  )  ] 


+  P^Cs(n) ]  Fn~IC i?( sin) ) ] 


It  is  possible  to  solve  this  functional  equation  for 
the  total  expected  reward  and  thus  estimate  the  effect  of 
a  given  policy  choice. 


2.5 


Policy  Optimization 

To  optimize  the  decision,  a  criterion  of  "optimum 


must  first  be  chosen.  The  rewards  earned  may  be  in  the 
form  of  money,  time,  material  goods,  etc.  and  the  reward, 

Is  really  a  utility  or  index  of  usefulness  to  the 
decision-maker.  If  "optimum”  means  maximising  the  total 
expected  reward,  or  total  expected  utility,  then  dynamic 
programming  will  yield  an  optimum  Dollcy  by  solving  the 
functional  equation  below  subject  to  the  Initial  conditions 
F°[s( 0; ]  specified  by  the  decision-maker. 

Fn[s(n)  i  =  maximum  {  qkCs(  n)  ]  +  P1[s(n)  ]Fn~'1  [  s(n)  ) 

+  P2tsiniJ  Fn“1[T?(s(n)i 1 

where  k  represents  the  options  available  at  time  n. 

Bellman's1  principle  of  optimality  states  that  the 
computed  solution  will  in  fact  be  the  best  policy  based 
on  the  criterion  of  maximizing  expected  utility. 

An  optimal  policy  has  the  property  that 
whatever  the  initial  state  and  initial  deci¬ 
sion  are,  the  remaining  decisions  must  con¬ 
stitute  an  optimal  policy  with  regard  to  the 
state  resulting  from  the  first  decision. 

In  solving  the  functional  equations  Bellman's  prlnelpl 
is  used  in  the  following  manner.  Subject  to  the  initial 
states  and  decisions  I  °i  s (  0 )  1  is  specified.  Then  F1  L s (  1  ;  ] 
is  found  using  the  functional  equation  relation  and  the 
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valuer,  F  s(  0)  ] .  is  then  found  from  F1 [ s ( I }  1  ana 

so  on.  Each  tlae  the  decision  at  time  n  Is  made  consistent 
with  decisions  already  made. 


2 , 6  Perfect  Information 

In  mam'  practical  situations,  perfect  or  nearly  per¬ 
fect  Inf orriation  about  the  underlying  Markov  state  can  be 
obtained  at  Increased  expense.  With  this  perfect  Informa¬ 
tion  channel  at  a  net  cost  of  an  additional  “A"  dollars, 
there  Is  a  third  option  open. 

If  the  perfect  information  channel  is  uscu  at  time  n 
the  state  of  knowledge  vector  becomes: 


1  j  Q  with  probability  s^(n) 
.0  .  I  with  probability  s_,(n) 


Usings  option  3  at  tlm.  r.,  the  associated  total  expected 
reward  before  the  channel  Is  used  Is  given  below. 


F^sjn)]  - 


-A  +  s1(n)L11  +  s^(n)L22 

♦  s1(n)  P1fl.0]Fn~1[T1(l.C)3 
+  s1(n)  P:fl.C]Fr~1[T:(  1,0)1 
+  s2(n)  P1fC.l3Fri~1[T1(  0.1)1 
+  s„(n)  f  0.1  IF11**1  f  T~ (  0. 1 )  1 


30  - 


Therefore 


the  equations  to  be  solved  for  the  optimum 


pol'cy  arei 


expected  reward  If  option  1  Is 
exercized 

expected  reward  If  option  2  is 
exercized 

expected  reward  if  perfect  Infor¬ 
mation  channel  is  used 

s1(n)Li1  -t  s2(n)Llt: 

+  P-Jstn)  [ s ( n )  ]  ] 

+  P^Ts(n)  ]  Fn~ 1  [  T .,  [  s  ( n )  ]  ] 

*-  ‘  - 


3l*n'L21  +  S2(‘n'lL22: 

+  P-.[s(n)  ]Fn~^T?[s(n)  1  ] 


-A  + 

(n)!,^  +  s2(n)L22 

+  s^n) 

P,  [llO]Fn"1[T, [1.0J] 

+  Sj(n) 

P  Jl.ClFn~1|T?[UQJ]1 

+  s2(n) 

P,  [Oil]F’n-1[T1  f  0.1 11 

+  s2(n) 

p.ro.nFn‘1rT?[o.i]] 

Comparing  options  1  and  2,  the  declslon-malcer  will 
choose  option  1  over  option  2,  l.e.  estimate  the  underlying 
Markov  state  as  1  instead  of  Z  and  act  accordingly  If: 
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Therefore  If  only  options  1  and  2  are  available  the 
solution  Is  trivial  and  made  on  the  basis  of  highest  Imme¬ 
diate  expected  return,  q  f  s(n) ] .  Adding  option  3  has  the 
effect  of  allowing  him  to  “Invest*  A  dollars  now.  In  hope 
of  getting  higher  overall  future  returns. 

Comparing  options  1  and  3,  the  decision-maker  will 
estimate  1  Instead  of  using  the  perfect  channel  (option  3) 
if; 


s1(ri)L11+s;,(n)L1 
+  h  [.  s(n)  ]Fn~1[T 


n '  r 2  \ 

jFH-^Tjsin)]]  \  > 
+  F?ra(n)]rn~I[T;>[s(n)33// 


-A  +  s1(n)L11  +  s2(n)L^. 

♦  f j(n)  P1[l.G3Fn'1[T;i[lt033 
+  s,(n)  P?[l_tOjFR"1[T.[lJc33 
+  s2(n)  P1[0.l3Fn*1[T1[0.113 
+  s2(n)  P:[Q.l]Fn~1[T2[c.l33 


The  equality  condition  above  can  be  interpreted  as  the 


value  of  li Hi  for  which  the  decision-maker  Is  Indifferent 
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Ji 

between  options  1  and  3*  If  (n)  Is  defined  as  the  value 
of  s^(n)  for  which  the  decision -maker  Is  Indifferent  between 
option  1  and  j*  the  preceding  equation  can  be  solved  for 
(n)  and  then  option  1  will  te  chosen  over  option  ')  if 
s1(  n)  ?!  s*  (  n) . 

Similarly,  comparing  options  2  and  3»  option  2  will  be 
preferred  over  option  3  If  ^(n)  S  l  n ; . 

Summarizing,  using  the  criterion  of  maximum  expected 
utility  of  rewards,  the  optimal  decision  will  be: 


a)  estimate  state  1  and 
act  accordingly  if 


£1(n)  > 


L22_L12 


L11+L22"L12“L21 


b) 


and 

estimate  state  2  and 
act  accordingly  if 


ana 


s^tn)  >  s.*(n) 


s1(n)  < 


L22-Ll, 


•)  *  *  I-  -j  ““  Lj  -s  •* 

11  ^  C.  1<_  cl  l 


s1(n)  ^  s**(  n  ) 


c ) 


reascertain  true  state  using  perfect  channel  11'  s(  n ! 
doesn't  satisfy  either  a)  or  b).  ~ 


2 . 7  Additional  Partial  Information 

A  further  practical  generalization  can  be  made  by 
assuming  that  the  observer  has  a  cnolce  of  using  a  second 
imperfect  channel  which  is  better  than  the  first,  costs 
"B"  dollars  more  per  usage,  but  still  isn't  perfect.  The 
decision-maker  has  also  to  decide  now  whether  the  better 
channel  is  worth  the  er.tra  money  for  ar.y  possible  stare 
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of  Knowledge  and  time  (n) 


To  answer  tr.ls  question,  he  aads  a  fourth  option  to  his 
functional  equation.  The  new  [  ;■  *  ]  matrix  corresponding  to 
the  new  output  channel  affects  the  updating  of  the  state  of 
knowledge  if  th-  nev;  channel  is  used.  Therefore,  define  a 
new  quantity  I  *(•)  to  represent  the  new  state  of  knowledge 
after  the  output  reading  "1"  from  the  new  cr.anr.el  has  been 
considered . 

Under  option  d  at  tire  n* 


r  ' 


=  -B  4  s1(n) 


+  s  ,(n)  • 


'expected  earnings  In  time  n  11  Indi¬ 
cation  1  is  received  from  the  new 
Lchannel 


expected  earriings  in 
cation  2  is  received 
c  hanne 1 


In  time  n  if  incii-N 
ed  from  the  new  j 

in  time  n  if  inai4 
ed  from  tne  r.ew  1 


fH[ s( n )  ]=  -?  4  3 ^ ( n ) X  +  s  . (n)Y 


where ; 


X  =  q;  [T.  [js(nl]l  4  ?  [r.  *  Ls(n)J  J !  I',  L  T.  *  Ls(n).j  1  i 

--  -A  1  ‘  ^  1  ‘ 

fc  .  ■  .1  I  I  I  ■  I  I  ■,  I  -J 

4  P-,[T1'[^.(njJ]Fn~1lT2[T1>L^(n).]]  ; 


Y  -  q4[T2’Usinj,]]  4  ^  [T  z  J^"1  [  ;  2  ■  [jj^  J  j 

+  P-JT-.'Lstnl]  JF11-1;  f:l  T.'UtnLj  j 
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The  value  of  this  technique  Is  that  a  solution  may  be 
found  for  any  length  of  time,  n,  remaining.  However,  this 
becomes  Impractical  as  n  becomes  large.  Fortunately  the 
equations  will  converge  or.  an  optimal  nollcy  which,  for 
large  n,  Is  independent  of  n.  This  "steady  state"  policy 
may  become  discernible  for  very  small  n  in  some  problems. 

The  next  chapter  presents  a  method  for  obtaining  the  "steady- 
state"  optimal  policy  directly. 

Consider  a  numerical  example  of  the  two  state  problem. 

A  computer  program  was  written  to  solve  the  general  two 
state  problem  and  Is  included  in  Appendix  I. 

The  observer  has  constructed  a  2  state  model  with  par¬ 
ameters  given  and  lias  the  option  of  using  a  perfect  channel 
or  estimating  the  underlying  Markov  state  and  acting  accord¬ 
ingly. 

The  probabilities  describing  the  underlying  process 
and  the  output  channels  are; 


The  associated  rewards  arei 
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A  =  cost  of  use  of  perfect  Information  output  channel 
=  5  "  loss  of  immediate  earnings  for  one  time  unit 

F°f s ( 0) ]  =  0. C  for  all  s(nj 

This  could  be  the  previous  communications  satellite 
example  with  the  "perfect”  information  being  obtained  by 
using  one  unit  of  time  to  send  a  special  test  message  to 
the  satellite.  The  test  message  would  be  returned  to  the 
sender  and  from  it  he  could  glean  the  "perfect"  information. 

The  computational  results  are  show  in  figure  ?. 
optimal  policy  is  l 

0  5  s  (n)  <  .b  ;  estimate  underlying  Markov  state  2 

and  act  accordingly 

.4£  s.  (  a)£  1.0  s  estimate  underlying  Markov  state  1 

and  act  accordingly 

The  perfect  information  channel  is  never  used  In  the 
optimal  policy.  In  relation  to  the  other  rewards,  the  cost 
of  using  the  perfect  information  channel  was  too  high.  This 
problem  is  then  equivalent  to  having  only  the  op" '•or..  and  2. 

There  is  a  growth  pattern  emerging  as  time  (n)  increases. 
In  figure  7,  the  curves  for  F^fstn) ]  tend  toward  a  fixed 
shape  and  the  separation  between  F^f  s(n)  ]  and  FJt~1[s(n-I )  1 
appears  to  be  approaching  2.29  as  n  increases.  The  optimal 
policy  is  Independent  of  clme  (n). 

In  the  previous  example  the  cost  of  perfect  information 
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was  too  high  to  warrant  usage  of  tne  "noiseless"  channel. 
More  illustrative  results  are  obtained  if  the  cost  of 
perfect  information  Is  reduced  to  be; 

A  »  loss  of  immediate  earnings  for  one  time  unit. 


The  results  of  the  calculations  are  given  in  figure  b. 
The  optimal  policy  is: 


c.  0  < 

sx(n) 

5 

0.4 

estimate  state 
dingly 

2 

end  act  accor 

C.4  4 

s-i  (n) 

< 

1.0 

estimate  state 
accordingly 

1 

and  act 

n~2l 


C.  0  4  s1(n)  5  0.4 
0,4  *  Sl(n)  *1.0 


estimate  state  2  and  act 
accordingly 

estimate  state  1  end  act 
accordingly 


n=3.4. . .  .9.10i 

0.  0  £  s^n) 

<  .38 

estimate  state  2  and  act 
accordingly 

.  38  <  s^n) 

<  .42 

use  perfect  information  source 

,42£  s1(n) 

<  1.0 

estimate  state  1  end  act 
accordingly 

Here  again  a  growth  pattern  on  expected  earnings  is 
becoming  discernible  as  n  increases.  The  separation  (gain; 


...  - - -  - - 

— - - j - : - 

— • — rtguj-w  e  ■  -  - 

26.0 

;  fotal  Expected  Earnings 

Using  Optimal  Policy 

with  Perfect  Information 
Available  at  a  Reasonable 
Copt 

24.0 

t 

j 

^~2U,102$Z 

1  22.0 

between  the  F*1  [ s (  n  )  ]  and  Pri~^  T s( n-I )  1  curves  appears  to  be 
approaching  2.3O  as  n  increases.  Notice  that  when  reason¬ 
ably  priced  perfect  information  became  available  the  growth 
of  expected  earnings  as  a  function  of  time  apparently 

increased  slightly  from  2.29  to  2.30. 


2 . 9  Comments 

This  particular  solution  technique  is  useful  in  deter¬ 
mining  optimal  policies  associated  with  partially  observable 
MaiKov  processes  for  small  time(n).  The  functional  equa¬ 
tions  which  must  be  solved  for  the  optimal  policies  are  of 
the  general  form  given  below  where  k(n)  is  the  policy  choice 
at  time  n. 


F^stn)]  = 


V£i2l)  +  V  : P^CsjriilF^-^Tjsinill 


while  it  is  theoretically  possible  to  obtain  a  solution 
to  tne  above  equation,  it  may  be  computationally  infeasible. 
It  is  relatively  easy  and  fast  to  obtain  a  numerical  solu¬ 
tion  for  the  two  state  partlaJLly  observable  Markov  process, 
but  increasing  the  system  to  even  five  states  may  be  pro¬ 
hibitive,  A  good  dead,  of  the  strength  of  this  technique 
depends  on  the  analyst’s  ability  to  model  the  real  world 
system  with  a  few  pertinent  states. 

This  technique  also  becomes  impractical  when  there  Is 
a  large  time  (n)  involved.  The  next  chapter  deals  with  che 
question  of  the  existence  of  a  steady  state  policy  for  a 
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partially  observable  Markovian  system  and  methods  of  deter¬ 
mining  it  without  Iteratively  solving  the  functional 
equations  Introduced  previously  for  ever  increasing  values 


nt  time. 


CHAFTL.-i  ill 

OPTIMAL  STEADY  STATE  POLICE  EE  TKhM I N A T I CM 

In  the  chapter  on  optimal  time  dependent  policies,  as 
time  grew  lare-e.  the  exported  earnings  seemed  to  converge 
on  a  discernible  growtn  pattern  and  the  policy  was  apparently 
becoming  independent  of  the  time  (n)  for  large  n.  In  many 
physical  situations  the  time  (n)  which  remains  for  the  real 
world  system  to  operate  is  large  and  sometimes  even  unknown. 

In  those  two  situations  it  Is  not  feasible  to  use  the  time 
dependent  solution  technique  for  optimal  policy  determination. 
The  question  of  the  distance  of  a  steady  state  policy  and 
a  method  of  determining  it  becomes  paramount. 

The  examples  of  Chapter  II  appeared  to  show  the  expected 
earrings,  F^fst  n ) ] .  converging  on  a  growth  pattern  in  the 
following  manner  for  large  time  n. 

F^istn)]  -  q-fsinil  +  ^F11,|s(n)]f,1'1[T1l5[n)]lj 

P^sCn)  ]— »  v[s(n)  j  +  nG  (large  n) 

where  v [ s ( n ) ]  Is  interpreted  as  setting  the 
ste&ay  state  shape  of  the  ^Lsln!  ]  curves  and 
G,  gain,  Is  the  steady  state  growth  per  unit 

time . 

3 . 1  The  State  of  Knowledge  as  a  Continuous  State  Markov 

Process 

To  Investigate  the  growth  pattern  of  expected  earnings, 
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It  will  be  necessary  tc  alter  the  concept  of  a  partially 
observable  Markov  process  model.  Formerly  It  was  Interpreted 
as  a  descrete  K  state  Markov  process  with  stochastic  output 
cnaLnnels.  net  ecus  now  on  the  state  of  knowledge  vector,  s ( n ) . 
It  has  n  components,  s,(n),  which  are  constrained  by: 


0  £  Sjin]  5  1 


(  n  }  =  1 


The  state  of  knowledge  vector  has  n-1  Independent 

components  and  may  thus  be  represented  as  a  point  In  an  n-1 

dimensional  space.  The  state  of  knowledge  vector  for  a 

partially  observable  Markov  process  model  is  In  fact  the 

state  variable  for  a  continuous  state  Markov  process. 

Consider  a  three  state  underlying  process  with  s(n) 

=  s  (n),  s,(n),  s  (n).  Since  s_(n)  =  1-s. ( n ) -s  , ( n ) ,  then 
•  ■  —  -  —  £  -  ■  ■  -  JL  ■  4  J  1 

s^(n)  and  s^,(n)  describe  the  observer's  state  of  knowledge. 


s2(.)  -♦ 

Figure  9  Continuous  State  Space 
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With  probability  Pi  [ 3(n)  j ,  it  will  be  transformation 
T j  [ s ( n )  I  which  describes  the  new  state  in  terms  of  the  pre¬ 
vious  state,  s (n) .  Since  the  transition  probability  depends 
only  on  the  current  state,  the  Markov  assumption  is  satisfied 
It  happens  that  the  state  of  knowledge  is  now  also  the  state 
variable  of  a  continuous  state  Markov  process. 

3.  •-  Steady  otate  uain 

Let  h[s(n) ]  be  the  probability  density  function  on 
what  the  observer's  state  of  knowledge  will  be  at  time  n 
far  Into  the  future,  giver  some  Initial  state  of  knowledge. 

A  completely  ergodic  Markov  process  is  one  whose  limiting 
probability  density  function,  h[s(r.)  ].  for  n  far  into  the 
future,  is  Independent  of  the  distribution  of  the  starting 
state  of  knowledge. 

The  continuous  state  Markov  process  which  has  s(n)  as 
its  sfat-e  variable  cannot  be  considered  to  be  completely 
ergodln.  Suppose  the  initial  state  of  knowledge  vector  for 
a  two  state  partially  observable  process  were: 


where  cq  =  a  rational  number 

Note  tha.  the  initial  state  of  knowledge  Is  precisely  specl 
fled  such  thav  the  initial  density  consists  solely  of  an 
Impulse  at  the  point  s^n)  =  cq.  As  time  progresses  and 
the  state  of  knowledge  1b  repeatedly  updated,  s(n)  will 
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always  remain  a  vector  whose  components  are  rational  numbers, 
by  the  nature  cf  the  transform  applied.  Therefore,  the 
limiting  probability  density  function  or.  the  state  of  Know¬ 
ledge  will  be  nonzero  only  at  a  set  of  points  selected  from 
the  rational  numbers.  Alternately,  suppose  that  the  initial 
distribution  of  the  state  of  knowledge  Is  described  by  a 
continuous  density  function.  The  limiting  density  h [ s ( n ) ] 
for  such  an  initial  state  of  knowledge  distribution  will 
be  nonzero  at  points  both  Inside  and  outside  the  set  of 
ratior.nl  numbers.  Thus,  the  process  is  not  completely 
ergoaic. 

12 

Karlin  investigates  the  limiting  steady  state  distri¬ 
bution  in  similar  problems.  A  limiting  density,  h L s ( n ) j . 
may  exist  for  the  clHoS  of  initial  den slties  which  are 
contlm  ous  over  some  range  of  the  allowable  sy  n)  .  Drake"" 
presents  a  method  of  computing  the  limiting  density  function 
for  an  arbitrary  continuous  initial  density. 

I r 

The  steady  state  gain,  0  ,  f or  a  given  policy,  k { n ) , 

could  be  calculated  using  the  fcllowlrg  relation.  Another 

K 

method  of  determining  the  gain,  G  ,  Is  presented  In  the  next 
section. 


pa  1 1  e  rn 


for  large  n,  and  that  a  steady  state  optimal  policy  exists. 


Trie  optimal  steady  state  policy  could  be  found  by  maximizing 
the  gain  G. 

max  G  s*  optimal  policy,  it 
k  cp 

A  method  of  policy  optimization  which  is  based  on  gain 
maximization  was  developed  by  Howard  for  use  on  deccrete  state 
Markov  systt  s.  His  algorithm  can  be  ada  ted  for  use  in 
optimal  policy  determination  on  partially  observable  Markov 
process  models. 

Beginning  with  the  basic  equation  for  expected  esmingsi 


max  ^  q1:f  s(n)  1  t 


P1k[s(n)]Fn~1[ 


v[  s(n)  ]  +  nG  for  n  large 


where  G  =  steady  state  gain  and  v [ s i n ) 1 
can  be  interpreted  as  initializing  and 
setting  the  shape  of  the  steady  state 
expected  earnings  curve  ?n  [  c  (  n )  ~j 


Substituting  the  steady  state  form  Into  the  basic 
equation  for  a  fixed  (not  necessarily  optimal)  policy,  k! 

v^Ls^nJJ  +  nGk 

=  q^fstn) ]  ^  f,k[s(n) j) vk[ T^tstn) ] j  >  (n-J)G' 

1  —  — 

=  dj-Tstn)  1  +  (n-])lk  +  P^k[  s  (  n)  ]vkf  [  s(  n )  J  J 


K 


K  , 


=  gain  using  a  fixed  policy  k 


Solution  of  toe  above  equation  would  yield  the  gain, 

S ,  and  the  curve  v  f  s  ( n  )  1  for  a  fixed  policy.  To  solve  th<= 
equation  by  coo. titer,  it  is  necessary  to  quantize  the 
v f  s ( n )  ]  curve  into  "  points.  Then  these  are  a  set  of  ■/, 
simultaneous  equations  and  M+l  unknowns.  The  M-t-1  unknowns 
are  the  K  from  v f s ( n ) j  and  1  from  3.  Consider  elding  a 
quantity,  c,  to  each  point  of  the  v [  r. (  n )  ]  curve. 


q[s(n) ]-[ v[s(n)  ]tc] 


qfs(n) ]-v[ s(n)  ]-c  + 


but  y  f  s(  n  )  1c  = 
q  I  s  ( n )  1  -vTs(n) 1  + 


+  [  s  (  n )  1  [  v  [  ^  [  s  (  n )  ]  ]  ■»  c  ] 

1  ~~ 

y  [£jnj. Jc  -  ^  Pj^Ls^jvf 

i  i 

c  f  s(  n)  1  •••'  c 

1 


1 


Notice  that  this  is  the  same  equation  ($1)  bach  ■>.  rain. 
This  implies  that  the  absolute  level  of  the  v [ o i n )  j  curve 
cannot  be  determined  from  these  equations.  Thfrefore  to 
allow  solution,  arbitrarily  fix  one  point  on  t.oe  curve. 

V  (  0  j  —  v 


u? 


As  will  bs  shown  later,  only  the  relative  value  cf  the 


vf  nil  curve  (l.e.  the  shape)  Is  necessary.  The  M  equa¬ 
tions  and  M  unknowns  may  be  solved  for  the  gain,  1,  and 
v [ s ( n ) 1  subject  to  the  above  stipulation. 

Howard's  algorithm  contains  a  policy  improvement  routine 
which  rapidly  converges  on  the  optimal  policy  and  thus  it 
is  necessary  to  solve  the  set  of  K  equations  for  only  a  few 
policies.  Later  in  this  chapter,  a  method  for  checking  the 
•optimal"  solution  is  Introduced  so  that  errors  introduced 
in  solving  the  many  simultaneous  equations  can  be  detected 
and  corrected.  Also,  a  technique  for  avoiding  the  need  to 
solve  the  K  simultaneous  equations  was  developed  by  Schweitzer 
and  will  be  Introduced. 


?or  a  fixed  policy  k,  the  following  is  solved  for  0' 
and  vk  f  s  ( n )  ] . 


Gk  =  q^[s(n)  ]  ** yk f  s(n)  1  +  JvJt[T1[ s>  n)  ]  j 


If 

The  next  step  is  to  use  the  G  and  v  f  s( n ) J  obtained 
to  find  a  better  policy.  riecall  that  uhe  optimum  policy 
maximized: 


F^aua]  =  I"  qk(n)Ulni]  *  ^P1k(n)[^]Fn“1[TlUl2^ 
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In  the  steady  state: 


n  I 


opt 


max 

k 


max 

k 


|qk[sini]  «■  )<_P1k[siniL][voptrT1[si,r1]].;r-I)Goi 


jqklsiniV(n-ir,opt+  Hilx>op  t[ 


rt 


j . 


ilnce  :•  t  here  is  the  gain  associated  will,  the  optimal 
policy,  it  is  not  a  function  of  k  and  an  equvalent  test 
quantity  to  be  maximized  as  a  function  of  policy  (k)  is. 


TEST  tslnll  =  1k(liHl)  »  ^P^lsUUJv^pTjtsiri;] 

1 

Because  L  P ^ k i s ( n )  ]  =  1,  any  additive  constant  in 
i 

vopt,fT.[S(n]  ]]  would  not  affect  the  test  quantity  and  there¬ 
fore  cr.ly  relative  values  of  vQ  t  [  [s(  n)  1 1  are  needed. 

Howard's  algorithm  says  that  maximizing  the  test 
quantity  using  v [ s ( n ) 1  from  some  arbitrary  policy  (not 
necessarily  optimal)  will  always  yield  a  new  policy  which 
Is  at  least  as  good  as  the  old  :  rbltrary  one  and  that  an 
optimal  policy  can  be  found  In  the  manner  illustrated  in 
figure  10. 
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By  refocusing  attention  from  the  interpretation  of  a 
partially  observable  Markov  process  as  sin  underlying  descro 
Markov  process  plus  output  channels,  it  is  seen  tnat  the 
state  of  knowledge  vector  is  the  state  variable  of  r  con¬ 
tinuous  state  Markov  process.  The  optimal  policy  can  be 
found  utilizing  a  digital  computer  and  Howard's  algorithm 
for  descrete  Markov  processes.  How  close  the  digital  solu¬ 
tion  is  to  the  continuous  state  solution  remains  to  be  seer. 

Experience  with  descrete  state  problems  has  shown 
Howard's  algorithm  to  be  computationally  efficient  ana  that 
the  sequence  of  policies  generated  iteratively  will  usually 
converge  in  a  small  number  of  cycles.  The  convergence  may 
be  hastened  by  selecting  the  arbitrary  initial  policy  to  be 
as  close  to  optimal  as  possible.  The  decision-maker  could 
incorporate  all  of  his  prior  feelings  into  the  initial 
policy  although  it  is  not  necessary  to  do  so. 

3.4  Proof  of  Policy  Convergence  and  Optimization 

A  proof,  adapted  from  Howard's  work,  Is  now  olfprea 
showing  that  the  policy  which  is  converged  upon  Is  In  fact 
the  one  with  the  highest  gain  of  all  possible  policies. 

Suppose  that  an  initial  policy  A  has  been  operated 
upon  and  the  policy  improvement  routine  has  produced  r. 
policy  B  which  is  different  from  A. 

Prove  :  GB  It  GA 
since  B  was  chosen  over  At 

TESTB[ s(n)  1  ft  TEST^s C n)  1  for  all  s(n) 
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qpt£l.r‘.i  1  ♦  yT  P1B[sin)  ]  vA[T1[s(n)  1 1 

1  ~  ~ 

Z  qA[siniJ  -y  p/tsyUl vA[ ^  [ sjnj. jj 

Le  1 1  t  1  ' 

=  resAsinp  .  rr,srJU(ni ; 

•  Vp1’'u;nnv*iT1isln|-n 

i 

-  ^  *f  sjn|  ]vA[  rL  [  s  ( nj  3 

#AlsinjJ  £  C  far  all  s(n) 

where  t  )  is  tr.e  improvement  In  the 

test  quantity  that  the  policy  Improvement 
routine  was  able  to  make. 

The  expressions  for  G5  and  GA  aret 

q3  =  qR(^Oii-v?[sXni]  4  ^p19lsinJI]vBtrit£ijni]] 


=  qATs(n)  ]-vArs 


£jpjl  +  y~V.  A[s(n 


"i  [sln|]vA[ Tjstn)  !  ] 


Subtracting  GA  from  GR  and  rearranging! 


GB-GA4v3[slni].-vA[sini]  =  q^s^j  -  q^stn.l 


ynpiBtsin23vB[T1[sinj_]] 

1 

^pl’',,t£inl]vA[T1[sin2;] 
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Introducing  V^(n)t 


GB-Grt-»-vBf  s(nj  ]-vA[s(n)  ] 


h'^i  s(nj  jv"[  T1f  s(r.)  ]  ] 
^T*  i In[s(nj ]vA[ I1[s(n) ] J 


P  A  P  .  -  L  . 

>~-GH+v'  i  s(  n  )  l~vA[s(  n)  ;  - 


=  8Brs(n)  1-r  y  j-1BLs(n)  3lvB[T1  [s(nj  ]  j-v^LT^ Ls(n)  ]  ]  ] 


Define: 


a~  ..  _  ,-A 
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Avf  s(n)  ]  ■=  v :<  [  £  (  n  )  l-vA[s^n)  ~) 


Substituting  the  definitions  into  the  previous  equaticr:! 


AG+Av[s(n)  ]  =  yB[s(n)  ]  -  ^  l)15[ s(n)  ]Av['f1Ls(nj  ]  ] 

i  ™_- 

1)  CkG  ^  »B['s(nj]-Avts(n)  I  <  ^  ¥^[  s(  n )  ]Av[  T ,  L  l  (  r. ) 


Note  that  the  above  is  Identical  in  form  to: 


~i  -  g  f  s { n  )  1  - v f  s ( n )  1  -  [  s(  n)  jv[  ri  [  s(  r.)  \ 


2) 


Recall  that  the  steady  state  probability  density,  hrsinj.] 


was  related  to  G  In  the  following  manner’. 


3) 


c 


[  h  [  s(  n )  ]q  [  s  (  r.  j  ]  ] d s  (  n ) 


So  the  solution  fbroG  In  equation  .  Is: 


4)  AG 

Since: 


^  [hB[slni]  gB[s(n)11 


ds(n) 


s(n) 1  0  for  all  s(n) 

and  a  property  of  all  probability  density  functions  is: 
hBf s(n) ]  t  0  for  all  s( n) 

Therefore  I 

A  G  i  0 

A  new  policy  obtained  using  the  algorithm  has  at  least 
as  high  a  steady  state  gain  as  the  old  policy,  furthermore. 
It  is  Impossible  for  a  policy  with  higher  steady  state  gain 
to  exist  and  not  be  discovered  ultimately  by  the  Iterative 


routine , 


V  V 

Assume  for  two  policies  X  and  Y,  that  G ‘  >  Z  tut  that 
the  Iteration  routine  has  converger,  cr.  X. 

Since  X  was  chosen  ever  Y  in  the  policy  improvement 
routine  and  the  policy  X  set  the  test  policy  Just  prior  to 
conve  rgence 


TESTX[s^n) 1  >  TEST 1 s ( n } 


Qx[s(  n)  ]  -*■  P^lsin ;  ]vX[s(nj  ] 


>  q  y[s(n)  1  ■*-  y^Pii'[s(n)]vA[s;nJ  j 


<*[s(n)  ]  -  TE3TX[s(n)  ]  -  TESTY[s(ri)  ]  £  0 


By  the  metnod  of  the  previous  proofi 

y  y 

AG  =  G  -  G  2  0 

Y  X 

Since  the  Initial  assumption  was  that  C>  >  J  ,  this 
Is  a  direct  contradiction  and  hence  It  Is  impossible  for  the 
algorithm  to  ultimately  converge  or.  a  policy  wnlch  has  less 
than  optimal  gain. 


3*  5  Reinterpretation  of  Relative  Values 

It  is  not  immediately  obvious  that  the  relative  values, 
vf s(n) 1 .  which  are  obtained  for  one  fixed  policy,  should  be 
useful  in  policy  improvement.  To  obtain  seme  insight  into 
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this  matter  and  into  the  general  concept  of  steady  state 
policy  determination,  consider  a  "policy  space"  which  con¬ 
sists  of  all  conceivable  policies.  For  any  policy,  k,  the 
state  of  knowledge  vector,  s( n ) ,  can  undergo  certain  trans¬ 
formations,  i s ( n  )  ] .  to  obtain  a  new  "state  of  knowledge" 
vector.  The  variables  that  are  a  function  of  policy  k,  are 
the  immediate  earnings,  q ,  [ s  y n  )  1 ,  and  the  probabilities 
that  specific  trans formations  are  applied,  P^  [s( n ) ] .  If 
we  have  M  possible  transformations,  there  are  M  independent 
functions  In  general  that  are  set  by  the  policy.  There 
are  M-l  independent  [  s ( r. )  1  and  one  q ^ [  s  (  n )  ] .  A  policy 
fixes  the  decision  to  be  made  for  each  siH 1-  Consider  one 
specnic  duc  arbitrary  s  ( n ) .  For  this  value  of  s(  n ) .  the 
M  independent  parameters  associated  with  the  decision  would 
specify  a  point  in  a  tdiclldean  M~space.  Thus,  every  con¬ 
ceivable  decision  could  te  represented  by  a  point  in  this 
"decision  space".  Oily  certain  of  those  points  would  rep¬ 
resent  allowable  decisions. 

Consider  two  decisions,  A  and  ~S,  with  points  DA  and  D. 

in  the  decision  space  for  a  particular  SlLl-  Now  consider 
all  possible  decisions  lying  on  the  line  segment  Joining  DA 
and  Dg  in  the  decision  space.  Pick  a  new  decision  on  that 
line  segment  and  define  it  to  be  a  "randomization"  of  A 
and  B.  If  r  Is  the  randomization  parameter  and  AB  is  the 
randomized  decision,  tne  new  variables  are  related  to  the 
A  and  B  variables  thusly. 


\  f  r.s.Lr-J  ;  rl'x  ■  -  in)  }  ~  ( 1 -r  j s\  nj  j 


qArJ  r.sjnj  :  =--  rgj ,s(n)  ' 


■;  i-r)g  .  i  s(nj  ■ 


rhe  gains  of  policies  A  and  B  are; 


h  l-4nj  jq  .  L s ( n ;  \:i 


rr[  str.)lq. 


The  gain  of  the  randomized  policy  is: 


(r)  =  ^  [  hA  r ,  _s(  r.  j  lqAnirts(n)]]ds(ii] 


where 


hAR[r,s(n)  1  =  rhBl  s(n)  ,  -r  (l-r)hA[  s(n;  ] 


Relating  this  back  to  policy  improvement,  consider 


policy  A  as  the  initial  policy  <?r.d  policy  AB  as  the  polli 


being  tested. 


[r.sinjj  =  improvement  in  test  -m .ant tty 


+  ^  Ff3[r,slnj]vA;Ti[s(n)] 

J  a  *• 


^PftslnllA 


r^sini'] 


Since  3Ars(n)]  Is  the  Improvement  In  the  test  quantity 
of  the  policy  improvement  routine  If  decision  3  Is  substi¬ 
tuted  for  original  decision  A,  and  since  h^[ s(n) ] S  0  for 
all  sin),  then  finding  the  decision  B  which  gives  maximum 
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test  quantity  Improvement  Is  equivalent  to  finding: 


max  r ) 

B  - 

fcr 


r=G 


For  a  given  policy  A,  the  algorithm  computes  (for 
each  B ;  the  directional  derivative,  evaluated  at  A,  of 
gain  from  A  to  B  In  decision  space.  It  then  selects  as  the 
new  policy  the  one  with  the  highest  directional  derivative 
of  gain.  If  A  Is  the  optimal  policy  then  all  trie  directional 
derivatives  evaluated  at  A  will  he  less  than  or  equal  to  zero. 
With  this  Interpretation  of  what  the  policy  Improvement 
routine  does.  It  becomes  clearer  why  the  relative  values, 
v [ s( n) 1  at  the  old  policy  a  are  useful.  They  determine 

s(n ) 1  which  In  turn  Is  closely  related  to  the  directional 
derivative  of  gain  evaluated  at  A. 

3.6  Example 

Using  the  technique  Just  developed  for  steady  state 
solution  for  the  optimal  policy,  consider  the  numerical 
example  Introduced  in  Chapter  II. 


1 

.9  .1 

B 

.9  .1 

[p]  = 

.fc  .6 

CF]  = 

.3  .? 

>  4 

i  i 

A  =  cost  of  use  of  the  perfect  Information  channel. 

A  =  5  +  loss  of  Immediate  earnings  for  one  time  unit. 
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F°f  s (  0)  ]  =  C  for  all  s(  0) 

A  computer  program  has  been  written  to  find  the  steady 
state  gain,  relative  values,  and  optimal  policy  for  a  two 
state  partially  obser/able  Markov  process  and  Is  Included 
in  Appendix  II. 

In  this  example  the  decision-maker  has  three  options 
open  to  him. 

option  1  :  continue  transmitting  until  the  next  output 
Is  received 

option  I  stop  transmitting  and  check  again  one  time 
unit  later 

option  3  i  use  perfect  Information  channel 

Recall  tfat  in  Chapter  II  It  was  found  that  the  cost 
of  perfect  Information  was  too  high  to  warrant  use  of  the 
"noiseless"  channel.  This  was  found  also  to  be  true  for 
the  steady  state  policy  here.  The  results  are  given  In 
Table  I  and  Figure  11  provides  a  graphical  comparison  of 
the  steady  state  relative  values  and  the  F'1’0  expected 

earnings  curve  found  In  Chapter  II.  They  are  almost  Iden¬ 
tical  in  shape.  By  looking  at  the  transient  expected 
earnings,  the  gain  was  predicted  to  be  about  2.29  and  the 
steady  state  gain  was  found  to  be  2.2989?  using  the  computer 
program  of  Appendix  II  and  the  techniques  introduced  in  this 
chapter.  With  the  cost  of  perfect  Information  reduced  tc  a 
reasonable  level  the  example  was  reworked. 
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TABLE  I 


EXAMPLE  RESULTS 

With  no  perfect  information  channel  available  the  gain 
in  steady  state  was! 


=  r.?939? 


The  relative  values  and  decisions  arei 


s1(n) 

v( s(n]  ) 

Decision 

C.  00 

+  O.2156O 

.  02 

-C,  014  7c 

( 

.  04 

-C. 2226? 

.  06 

-0.39100 

C 

.06 

-0.39865 

tL 

.10 

-0.77769 

.12 

-C. 98307 

3 

.  14 

-1.77914 

t'. 

.16 

-1.41860 

2 

.16 

-1.62406 

2 

.20 

-1.72908 

2 

.  "2 

-1.93353 

2 

.24 

-2. 14215 

2 

.  ?6 

-2.36346 

2 

.  26 

-2.58545 

2 

•  3C 

-2.74256 

2 

.  32 

-2.99491 

2 

•  34 

-3.19564 

2 

•  36 

-3.39637 

2 

.38 

-3.62377 

2 

.40 

-3.82286 

1  or  2 

.  42 

-3.72195 

1 

.44 

-3.43766 

1 

.46 

-3.36893 

1 

.48 

-3.26575 

1 

.50 

-3.16257 

1 

s1(n) 

v( s( n) ) 

Decision 

•  5*; 

—  j .  0..  34 1 

1 

.5-4 

-2,9161  ' 

1 

•  56 

-2.97431 

1 

•  56 

-2 . o9816 

1 

.60 

-2.04197 

i 

.  6. 

-7.53330 

1 

.64 

-2.424q3 

1 

.  66 

-2.7  4  044 

1 

.  6c 

-2. 11952 

1 

.70 

-7.0576? 

1 

.77 

-2. 00757 

1 

.74 

-1.89419 

1 

.76 

-1. 72660 

1 

.78 

-1.01142 

1 

.  0  0 

-1.54021 

1 

.82 

-1.47730 

1 

.04 

-1.19940 

1 

.66 

-1 . C5639 

1 

.86 

-0.9363c 

1 

.90 

- c. 76434 

i 

.92 

-0.03339 

1 

.94 

-0.35545 

1 

.96 

-C.2496: 

1 

.9t> 

-0. 09156 

1 

l.CG 

-c. ococo 

1 
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TAfiLb  II 


EXAMPLE  RESULTS 

With  perfect  Information  available  the  gain  In  steady 
state  was; 


,  299  5b 


The  relative  values  ana  decisions  are: 


s^n) 

v(  s(n) ) 

Decision 

s1(nj 

v( s( n) )  Decision 

0.  00 

+  0.16609 

.5- 

-5. 050c4 

1 

X 

.02 

-0.  04244 

C. 

.54 

-■•.94S54 

1 

.  04 

-C.^5055 

C. 

•  5c 

-.'.9  3334 

1 

.  Co 

-0.4166? 

2 

•  58 

-..74  04? 

1 

,0o 

-0.62634 

.60 

-2.6693? 

1 

.10 

-0.80534 

2 

.62 

-2.56009 

1 

.12 

-1.012?1 

2 

.64 

-2.45:  Cl 

1 

.14 

-1 , 3°6?o 

2 

.  06 

-  t. .  ?  0  ?  06 

1 

.16 

-1.44621 

2 

.  68 

.15612 

] 

.16 

-I.65I66 

2 

.70 

-2 . 08  4  46 

1 

.20 

-1.75667 

2 

.  72 

-<..03414 

1 

.  2  ■ : 

-1.96112 

~i 

.74 

-1.9.074 

1 

.24 

-2.27965 

2 

.76 

-1.75335 

1 

.  26 

-2.4110? 

0 

<_ 

.78 

-1 . 637yo 

1 

.  26 

-2. 61 301 

2 

.80 

-1.56673 

1 

•  30 

-2.77013 

2 

.  0  2 

-1.49880 

1 

.32 

-3. 02245 

2 

.84 

-1.21792 

1 

•  34 

-3.22317 

2 

.86 

-1.  -A-,4hQ 

1 

•  36 

-3.42389 

2 

—  C.  yea.,  0 

1 

.36 

-3.65127 

2 

.90 

-0.78979 

1 

.40 

-3.81191 

3 

.92 

-0,65884 

i 

.42 

-3.74944 

1 

.94 

-0.41093 

1 

.44 

-3.4597? 

1 

.96 

-0.27510 

1 

.46 

-3.39098 

1 

.  9o 

-0.11798 

h 

.4 

.  4c 

-3.287?4 

1 

1.00 

-0. 0000c 

4 

.50 

-3.19451 

1 
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II 


A  =  cost  of  use  of  perfect  information  channel. 

=  loss  of  immediate  earnings  for  one  tine  unit. 

The  steady  state  gain,  relative  values,  and  pel  cy  are 
given  in  Table  II  and  Figure  1.  provides  a  graphical  rempar- 
Ison  of  steady  state  relative  valuer,  and  F  f  s ( 1 0 1  1 .  the 
expected  earnings  found  when  the  same  example  was  worked 
in  Charter  13.  Again  the  curves  are  nearly  identical  in 
shape  fmo  the  steady  state  gain  of  2,. '99 ‘jH  is  near  the  2,30 
previously  predicted.  The  results  obtained  by  the  two 
different  techniques  support  each  other.  Notice  that  the 
availability  of  perfect  Information  increased  the  gain. 

If  finer  precision  is  desired  in  the  range  of  s ( n ) 
where  a  change  of  "best  option*  takes  place,  a  finer  grid 
could  be  used  in  that  region.  That  is,  in  breaking  the 
continuous  s ( n )  vector  into  discrete  points,  make  more 
divisions  in  regions  of  particular  interest. 


3 . ?  Verifying  the  Numerically  Determined  Policy 

Because  the  computer  solution  involves  a  aescrete 
approximation  to  the  continuous  s ( n )  vector,  the  numeri¬ 
cally  produced  "optimal"  solution  could  vary  from  the  true 
optimal  solution.  To  check  on  the  accuracy  of  the  numerical 
solution,  one  might  vary  the  number  of  descrete  points  used 
to  approximate  the  continuous  vector  end  note  the  effect 
this  lias  on  the  solution.  A  much  better  technique  is  to 


test  the  steady  state  solution  in  question  by  use  of  the  tlme- 


dependsnt  techniques  of  Chapter  II.  Heccll  that  the  observer 
specifies  some 
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Initial  expected  earnings,  F°f  s  (  0)  ] ,  and  the  time-dependent, 
techniques  of  Chapter  II  compute  F1  [  s(  1 )  J  and  so  forth 
iteratively.  The  steady  state  relative  values  were  inter¬ 
preted  as  specifying  the  shape  of  the  s(  n)  ]  expected 
earnings  curve  for  n  large.  Therefore  if  the  true  steady 

state  relative  values,  v [ s ( n ) 1  -  were  used  as  the  initial 

0  1 

expected  earnings  F  [ s ( 0) ] .  then  F  [ s ( 1 ) ]  should  be  simply 

Q 

Ft stOll  plus  the  steady  state  gain. 


P1 f s ( 1 J j  =  F°[ s(  C) ]  •  G  =  v[ s( n) j  +  G 

This  provides  a  check  on  the  numerical  steady  state 
policy  which  may  be  in  question. 

Fig-ure  13  shows  the  result  of  such  a  check  which  was 
run  on  the  example  which  has  been  used  throughout  this 
report. 

3 . 3  Computational  Consider  at  ions 

Schweitzer10  developed  on  Improvement  on  Howard's 
algorithm  for  dsscrete  state  Markov  processes  which  makes 
It  computationally  more  practical  for  problems  with  a  large 
number  cf  states.  Normally  the  policy  improvement  portion 
of  the  routine  is  used  on  all  of  the  states  before  solving 
the  K  simultaneous  equations  of  the  value  determination 
portion  of  the  cycle.  Schweitzer  noted  that  If  the  policy 
improvement  was  done  on  only  on*  state,  the  solution  to 
the  new  set  of  simultaneous  equations  1  -  quite  simply 
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related  to  the  solution  of  thw  cld  set  of  simultaneous  equa¬ 
tions.  Ey  Judicious  choice  of  the  arbitrary  Initial  policy, 
the  Initial  solution  to  the  simultaneous  equations  Is  trivial. 
Effectively,  the  set  of  equations  need  never  be  solved  and 
a  major  deterent  to  the  use  of  the  algorithm  has  been  removed. 
Schweitzer  estimates  that  with  his  modification,  Howard’s 
algorithm  would  be  able  to  handle  on  the  order  of  five  thou¬ 
sand  desi  rete  states. 

Thus,  dynamic  programming  techniques  have  been  shown 
to  be  of  use  In  the  determination  of  optimal  steady  state 
policies  associated  with  partially  observable  Harkov  processes. 
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CHAPTEH  IV 


CONCLUDING  55 MARKS 

The  partially  observable  Markov  process  has  been  pre¬ 
sented  and  soce  of  Its  properties  discussed.  The  primary 
area  of  Invest lgat  lor.  In  this  report  was  the  selection  of 
a  course  of  action  i  rom  a  set  of  alternatives  using  only 
the  Information  about  the  system  which  is  available  from 
tue  observable  outputs. 

Dynamic  programming  techniques  were  shown  to  be  of 
use  In  the  optimization  of  both  transient  and  steady  state 
policies.  While  theoretically  the  optimization  can  always 
be  done,  there  is  a  definite  computational  limitation  which 
was  discussed. 

There  are  several  extensions  of  this  Investigation 
that  could  be  aa.de.  The  concept  of  discounting  of  future 
rewards  could  be  considered.  The  optimum  placement  of  In¬ 
vestment  dollars  to  Improve  prediction  abilities  and  a\erage 
earnings  could  be  Investigated.  In  addition,  time-variant 
system  parameters  could  be  Introduced. 

The  optimization  technique  used  deals  with  a  problem 
of  much  higher  dimensionality  than  that  of  the  original 
underlying  process.  A  method  is  needed  which  will  allow 
solution  of  sequential  decision  problems  beyond  the  scope 
of  those  that  can  be  handled  by  the  technique  presented  in 
this  report. 
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APPENDIX  I 


COMPUTEH  SOLUTION  FOR  THE  OPTIMAL 
TIKE  DEPENDENT  POLICY 

In  chapter  two,  equations  for  a  two  state  process  are 
formulated.  They  are  solved  by  this  program  for  the  case 
where  the  decision-maker  has  available  the  options  of • 

1)  estimate  state  l  and  act  accordingly. 

2)  estimate  state  2  and  act  accordingly. 

3)  use  perfect  Information  channel  at  added  cost. 

For  the  two  state  process  the  state  of  knowledge  vector 
s^Cn),  s2(n)  18  specified  by  s^(n)  since  s2(n)  =  l-s^(n). 

The  program  breaks  s^(n)  (which  can  take  any  value  from  0.0 
to  1.0)  into  51  points  and  calculates  the  maximum  expected 
utility  of  rewards,  Fn(s(n) ).  for  each  of  the  51  points. 

The  Initial  values  F°( 8(0) )  must  be  specified  and  then  the 
program  calculates  F^(  kill)  by  finding  the  maximum  of  the 
three  possible  options  rewards.  F  ( b(2) )  is  then  found 
from  F1 (s( 1 ) ) .  This  continues  on  up  to  n=n^6X,  which  Is 
specified  by  the  person  using  the  program.  The  output  Is 
In  the  form  of  expected  total  utility  of  rewards  at  each  of 
the  51  points  for  each  n  from  1  to  n^^plus  the  decision 
D(n,k)  to  be  made  at  each  point  k  for  each  time  n.  This 
decision  Is  optimal  based  on  the  criterion  of  maximizing  the 
expected  utility  of  rewards. 
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Figure  14  Flow  Graoh  -  "Vt'lue" 
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PROGRAM  LISTING  -  "VALUE" 


DIMENSION  i'i  1]  ,  51  )  .  P{  2,2),  FU  , 2  ) ,  R(  2 ,  2  )  ,  X(  51 ) 

DIMENSION’  PONE(  51  I  ,  PTWO(  51 )  ,  TONE(  51 ) ,  TTWO(  51 ) ,  LONE(  51 )  ,  LTWO(  51  ) 
C OKK ON  E ,  F ,  ? ,  R ,  X ,  PONE t  PTWO, TONE , TT WO, LONE , LTW C 

~i  r\  nu  •  ->  /  I,  T-i  r\  r  \ 

r  wi inAi  ^  ^ i  x  w »  j  / 

3  FORMAT  (5F10.  5) 

READ  2  ,  P(  1 , 1  )  ,  P(  1 , 2  )  ,  P(  2 , 1 ;  ,  P (  2 , 2  ) 

READ  2,Fll,l),F(l,;!),F(I:,li,"(2,i) 

READ  ,  R(  1 . 1  /  ,  R (  1 , M)  ,  R(<f,l),R(2,2) 

A  =  10.0 
DO  4  I  =  1,53 

4  E( 1 , 1  )  =  0.0 
X(  1)  =  C. 0 

DO  6  L  =  1, 50 
M  =  L  +  1 

6  X(M)  =  X(L)  -t-  .0, 

DO  ?  K  =  1,51 

PONE  l  K )  =  X(K)#(F{l,l)*F(l,i)4F(i,2'*F(2,l)) 

PONE(K)  -  P0NE(K)4(1.0-X(K)*<P(2,2;-JF(2,1)+F(2,1)»F(1,1)) 
FTWO(K)  =  1.0  -  PONE(K) 

TONE (K)  =  (X(K)*P(  1,1  )*F(1,1)  +  (1.0-X(K))*P(2,1)#F(1,1)  )/PONE(K) 
TTWC(K)=(X(K)*P(l,l)*F(l,2)+(1.0-XfK))*P(2,l)*F(l,2)  )/PTWO(K) 

J  =  51 

10  IF  (TOKS(K)  -  X( J) )  c,9,9 

8  J  =  J  -  1 

IF(  J )  11,10,10 

9  LONE(K)  =  J 

11  J  =  51 

19  IF(TTWQ( K)-X( J ) )  13,12,12 
13  J  =  J  -  1 

IF(J)  7,19,14 

12  LTWO(K)  =  J 

7  CONTINUE 

PRINT  103, ( FONE( J ) , J  =  1,50) 

PRINT  103, ( PTWO( J ) , J  =  1,50) 

PRINT  103, ( T  ONE ( J ) , J  =  1,50) 

PRINT  103, ( TTWO( J ) , J  =  1,50) 

DO  100  N  =  2,11 
M  =  N  -  1 
I  =  LONE( 51 ) 

J  =  I  +  1 

SLOPE  =  (E(M,  J>E(M,I)  )/.02 

TEST3  =  (T0NE(5i)-X(I))*SL0PE4E(M,I) 

I  =  LTWO( 51 ) 

J  =  I  4  1 

SLOPE  =  (E(M,J)-E(M.I))/.02 

TEST4  =  (TTW0(5i)-X(I))*SL0PE+E(M,I) 

TEST 3  =  (-A)4P0NE(51>*TEST3+PTW0(  51)*TEST4 
DO  99  K  =  1,51 
K  =  N  -  1 
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I  =  L  OX'S  (  K  ) 

J  =  I  ♦  i 

3L0FS  -  (E(N,J)-E(K,I)}/.02 

PAST!  =  (TONE(K)  -  X(  I )  )*SLCPE4S(  M ,  I ) 

I  =  LTWO(K) 

J  =  I  4  1 

SLOPE  =  (E(M,J)-E(M,I)  V,  0? 

PAST2  =  (TTW0(K)-X(I))fc5LCPE4E(M,I) 

PAST  =  PONE(  K)*PAST1  +  PTWO(  K )*PAST2 
TEST1  =  X(K}4rR(l,l)-(1.0-X(K))*R(l,2)+PAST 
TESTS  =  (-X(K) )«H( 2,1)  >  (1.0-  X(  K)  )#H(  2, 2  )4  PAST 
IF( TEST1-TEST2 )  61,6 2,61 

61  I F( TESTS  -  TEST3)  63,64,64 

62  IP( TEST1  -  TEST 3)  6 3, 6  6.63 

63  E(N,K)  =  TEST3 
GO  TO  99 

64  E ( N , K )  =  test; 

GC  TO  99 

65  E  (  N ,  K )  =  TEST1 
99  CONTINUE 

100  CONTINUE 

DO  101  I  =  1,11 
L  =  I  -  1 
PHINT  5,  L 

5  FORMAT ( 27H  EXPECTED  EARNINGS  AT  LEVEL, 2X, 12) 

101  PRINT  103, (E(  I,J),J  =  1,50) 

103  FORMAT(  10F10.  5 ) 

CALL  EXIT 
END 
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APPENDIX  II 


COMPUTER  SOLUTION  FOR  THE  OPTIMAL  STEADY  STAIa  POLICY 

In  Chapter  III  an  algorithm  is  presented  which  allows 
determination  of  certain  steady  state  policies  associated 
with  partially  observable  Markov  processes.  That  method  is 
used  in  this  program  on  the  general  two  state  process  where 
the  decision-maker  has  available  the  options  oft 

1)  estimate  underlying  Markov  state  1  as  the  current 
Markov  state  and  act  accordingly. 

2)  estimate  underlying  Markov  state  2  as  the  current 
Markov  state  and  act  accordingly. 

3;  use  perfect  information  channel  at  added  cost. 

For  the  two  state  process  the  state  of  knowledge 
vector,  s(  n) .  Is  fully  specified  by  s^n)  since  s^(  n)=l-s^( n) . 

The  program  breaks  the  continuous  s^(n)  Into  51  points 
and  determines  the  optimal  policy  and  associated  relative 
values  and  gain.  The  decision  Is  optimal  based  on  the 
criterion  of  maximizing  expected  utility  of  rewards.  Figure  1C 
of  chapter  III  very  adequately  serves  as  a  flow  graph  of  this 
program. 
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PROGRAM  LISTING  -  "POLICY" 


C 


1 


27 

51 
C 

4 

52 

53 

5 

28 

C 

6 


7 

9 

2  00 


10 

c 

102 


11 

12 

50 


C 


DIMENSION  NA<  52) ,K(  51 ) , P(1 53, 51 ) , y( 3, 51 ) , V( 51 ) , A( 51 , 51 ) , 3( 51 ) 

COMM  ON  NA,K,P,Q,V,A,B 

INITIAL  POLICY  VECTOR 

NA(52)  =  153 

DO  1  I  =  1,20 

K(I)  =  2 

DC  ~  I  =  21,23 

K(  I )  =  3 

DO  3  I  =  76,31 

K(I)  -1 

FORMAT  (11F9.5) 

FORMAT  (6F9.5) 

FORM  0  ’/ECTOR 
FORMAT  (2F6.3) 

FORMAT  (7F6.3) 

DO  53  J  =  1,51 

READ  4,  Q(1,J),Q(2,J) 

DO  5  J  =  1,51 
Q(3,J)  =  0.0 
FORMAT  ( 2H  Q) 

FORM  +P  VECTOR 
FORMAT  (4F6.5) 

DO  7  I  =  1,153 
DO  7  J  =  1,51 
P(I,J)  =  0.0 
FORMAT  (17F4.3) 

DO  200  I  =  1,153 

READ  9,  (P(I,J),J  =  1,51) 

FORM  NA(  I ) 

NA(  1 )  =  0 

DO  10  I  =  2,51 

J  =  I  -  1 

NA(  I)  =  NA(J)  +  3 

FORM  A  AND  B  VECTORS 

V(51)  =  0.0 

DO  11  I  =  1,51 

L  =  K(  I ) 

LL  =  NA( I )  +  K( I ) 

B(  I )  =  Q(  L,  I ) 

AC, 51)  =  1.0 
DO  11  J  =  1,50 
A(I,J)  =  ( **P(  LL,  J) ) 

DO  12  I  =  1,50 
A(  I ,  I )  =  A(  I ,  I )  +  1.0 
CALL  LINEAR  EQN  SUBROUTINE 
SCALE  =  1.0 
FORMAT  (3F9.5) 

M  =  XS IKEQF( 51, 51,1, A, B, SCALE, V ) 

PULL  RESULTS 
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GO  TO  ( 13,14, 14), M 

13  DC  lb  I  =  1,51 

16  V( I)  -  A( 1,1) 

G  -  A(  51 « 1  ) 

V(  31 )  =  c.o 
iron  =1 

DC  1  go  1  =  1,51 

TEMP  =  -99999. 

NT  EM  P  =  C 
I M I N  =  NA(  I )  -  1 
I  MAX  -  NA(  I  -  1) 

DO  1?  K  -  IMIK.IKAX 
KALI  =  M  -  NA(  I ) 

TEST  =  Q(KALT,  I) 

DC  16  J  =  1,51 

18  TEST  =  TEST  +  P(M,J)*V(J) 

I F(  TEST  -  TEMP)  1?,?C,21 
PI  NTEMP  =  KAL? 

TEMP  =  TEST 
GO  TO  17 

2 C  IF( NTEMP  -  K( I) )  21,17,21 

17  CONTINUE 

IF<  NTEMP  -  K(  I )  )  22,100,22 

22  I  TEH  =  2 

K(  I )  =  NTEMP 

100  CONTINUE 

GO  TO  (101,102), ITER 
C  PRINT  OUTPUT 

101  PRINT  2  3 

23  FORMAT  (loH  DECISION  VECTOR) 

24  FORMAT  (51 12) 

PRINT  24,  (K( I), 1=1 ,51) 

PRINT  25 
26  FORMAT  (F9.5) 

PRINT  26,  G 

25  FORMAT  ( 16H  GAIN  AND  VALUES) 

70  FORMAT  (11P9.5) 

71  FORMAT  (7F9.5) 

PRINT  70,  ( V( I) , I  =  1,11 ) 

PRINT  70,  (V(I),I  =  12,22) 

PRINT  70,  (V(I),I  =  23*33) 

PRINT  ?0,  (V(I),I  =  3*+, 44) 

PRINT  71,  (V(I),I  =  45,51) 

GO  TO  103 

14  PRINT  15 

15  FORMAT  (23H  NO  SOLUTION  FOR  VALUES) 
103  CONTINUE 

CALL  EXIT 
END 
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